How to Use Boosting for Tumor Classification with Gene Expression Data
نویسندگان
چکیده
Motivation: Microarray experiments generate large datasets with expression values for thousands of genes but not more than a few dozens of samples. Accurate supervised classification of tissue samples in such high-dimensional problems is difficult but often crucial for successful diagnosis and treatment. A promising way to meet this challenge is by using boosting in conjunction with decision trees. Results: We demonstrate that the generic boosting algorithm needs some modifications to become an accurate classifier in the context of gene expression data. In particular, we present a feature preselection method, a more robust boosting procedure and a new approach for multi-categorical problems. This allows for slight to drastic increase in performance and yields competitive results on several publicly available datasets. Availability: Software for the modified boosting algorithms as well as for decision trees is available for free in R at http://stat.ethz.ch/∼dettling/boosting. Contact: [email protected] †Corresponding author. Tel.: +41-1-632-6683; fax: +41-1-632-1228; e-mail: [email protected]
منابع مشابه
Feature Selection and Classification of Microarray Gene Expression Data of Ovarian Carcinoma Patients using Weighted Voting Support Vector Machine
We can reach by DNA microarray gene expression to such wealth of information with thousands of variables (genes). Analysis of this information can show genetic reasons of disease and tumor differences. In this study we try to reduce high-dimensional data by statistical method to select valuable genes with high impact as biomarkers and then classify ovarian tumor based on gene expression data of...
متن کاملClassification and Biomarker Genes Selection for Cancer Gene Expression Data Using Random Forest
Background & objective: Microarray and next generation sequencing (NGS) data are the important sources to find helpful molecular patterns. Also, the great number of gene expression data increases the challenge of how to identify the biomarkers associated with cancer. The random forest (RF) is used to effectively analyze the problems of large-p and smal...
متن کاملSTUDY OF HMGA2 GENE INHIBITION WITH SPECIFIC SHRNA AND SIRNA AND INVESTIGATION OF CORRESPONDING EFFECTS ON DOWNSTREAM GENE EXPRESSION IN MDA-MB-231 CANCER CELLS: A BIOINFORMATIC AND EXPERIMENTAL STUDY
Background & Aims: The use of siRNA to silence gene expression is increasingly expanding today. The aim of this study is to bioinformatically and experimentally investigate the inhibition of the HMGA2 gene and its corresponding effects on downstream genes expression rate in MDA-MB-231 cancer cell treated by shRNA and siRNA specific to HMGA2. Materials & Methods: To perform this bioinformatic a...
متن کاملModification of the Fast Global K-means Using a Fuzzy Relation with Application in Microarray Data Analysis
Recognizing genes with distinctive expression levels can help in prevention, diagnosis and treatment of the diseases at the genomic level. In this paper, fast Global k-means (fast GKM) is developed for clustering the gene expression datasets. Fast GKM is a significant improvement of the k-means clustering method. It is an incremental clustering method which starts with one cluster. Iteratively ...
متن کاملEvaluation of PRR11 gene expression changes and its relationship with tumor size in patients with gastric adenocarcinoma
Introduction: Gastric cancer is one of the most common gastrointestinal tract neoplasms. Because of its invasion, and nonspecific symptoms and signs, the disease is often diagnosed at an advanced stage with short survival. PRR11 participates in the initiation and progression of lung cancer and breast cancer by regulating important genes involved in cell cycles and tumorigenesis. In this researc...
متن کامل